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Abstract 

I briefly report on some unexpected results that I obtained when optimizing the model 
parameters of the Lasso. In simulations with varying observations-to- variables ratio n/p, 
I typically observe a strong peak in the test error curve at the transition point n/p = 1. 
This peaking phenomenon is well-documented in scenarios that involve the inversion of 
the sample covariance matrix, and as I illustrate in this note, it is also the source of the 
peak for the Lasso. The key problem is the parametrization of the Lasso penalty - as e.g. 
in the current R package lars - and I present a solution in terms of a normalized Lasso 
parameter. 



1 Introduction 



In regression and classification, an omnipresent challenge is the correct prediction in the 
presence of a huge amount p of variables based on a small number n of observations, and for 
any regularized method, one typically expects the performance to increase with increasing 
observations-to- variables ration n/p. While this is true in the regions n > p and n < p, some 
estimators exhibit a peaking behavior for n = p, leading to particularly low performance. 
As documented in the literature (Raudys and Duin 19981, this affects all methods that 



use the (Moore-Penrose) inverse of the sample covariance matrix (see Section [3] for more 
details). This leads e.g. to the peculiar effect that for Linear Discriminant Analysis, the 
performance improves in the n = p case if a set of uninformative variables is added to the 
modeQ In this note, I show that this peaking phenomenon can also occur in scenarios where 
the Moore-Penrose inverse is not directly used for computing the model, but in cases where 
least-squares estimates are used for model selection. One particularly popular method is the 
Lasso ( jTibshirani 19961 and its current implementation in the software R. As illustrated in 
Section [2] its parameterization of the penalty term in terms of a ration of the £i-norm of the 
Lasso solution and the least-squares solution leads to problems when using cross-validation 
for model selection. I present a solution in terms of a normalized penalty term. 
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2 Simulation Setting and Peaking Phenomenon 



For a p-dimensional linear regression model 

y = x T (3 + e, 

the task is to estimate j3 based on n observations {(x±,yi), . . . ,(x n ,y n )} C W x R. As 
usual, the centered and scaled observations are pooled into X = (xi, . . . ,x n ) T G M nxp and 

J/=(!/l I -,!/n) T eK n 

In this note, I study the performance of the Lasso ( Tibshirani[ 1996) 



Pusso = argmin{||y-X/3|| 2 + A||/3||i}, A>0 

for a fixed dimensionality p and for a varying number n of observations. Common sense tells 
us that the test error is approximately a decreasing function of the observations-to-variables 
ratio n/p. However, in several empirical studies, I observe particularly poor results for the 
Lasso in the transition case n/p = 1, leading to a prominent peak in the test error curve at 
n = p. 

In the remainder of this section, I illustrate this unexpected behavior on a synthetic data 
set. I would like to stress that the peaking behavior is not due to particular choices in the 
simulation setup, but only depends on the ratio n/p. I generate n tota i = 5000 observations 
Xi G M 90 , where X{ is drawn from a multivariate normal distribution with no collinearity. 
Out of the p = 90 true regression coefficients f3, a random subset of size 20 are non-zero and 
drawn from a univariate distribution on [—4, +4] . The error term e is normally distributed 
with variance such that the signal-to-noise-ratio is equal to 4. For the simulation, I sub-sample 
training sets of sizes n = 10, 20, . . . , 190, 200. The sub-sampling is repeated 10 times. On the 
training set of size n, the optimal amount of penalization is chosen via 10-fold cross-validation. 
The Lasso solution is then computed on the whole training set of size n, and the performance 
is evaluated by computing the mean squared error on an additional test set of size 500. 



I use the cv. lars function of the R package lars version 0.9 — 7 (Hastie and Efron, 20071 
to perform the experiments. The mean test error over the 10 runs are displayed in the left 
panel of Figure [T] As expected, the test error decreases with the number of observations. 
For n = p however, there is a striking peak in the test error (marked by the letter X), and 
the performance is much worse compared to the seemingly more complex scenario of n <C p. 
We also observe the peaking behavior in the case where n = p in the cross-validation split 
(marked by the letter O). The right panel of Figure [T] displays the cross- validated penalty 
term of the Lasso as a function of n. Note that in the cv.lars function, the amount of 
penalization is not parameterized by A G [0,oo[ but by the more convenient quantity 

8 = H^H 11 g [0,1], (1) 

IIA>islli 

Values of s close to correspond to a high value of A, and hence to a large amount of 
penalization. The right panel of Figure [T] shows that the peaking behavior also occurs for 
the amount of penalization, measured by s. Interestingly, the peak does not occur for n = p, 
but in the case where the number of observations equals the number of variables in the cross- 
validation loops. This peculiar behavior is explained in the two following sections, and I also 
present a normalization procedure that solves this problem. 
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Figure 1: Performance of the Lasso as a function of the number of observations. Left: test 
error. Right: penalty term s as defined in Equation Q. 

3 The Pseudo-Inverse of the Covariance Matrix 



It has been reported in the literature (Raudys and Duin 1998 Tresp 2002; Opper 20011 



that the pseudo-inverse of the covariance matrix 
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is a particularly bad estimate for the true precision matrix XI -1 in the case p = n. The ratio- 
nale behind this effect is as follows. The Moore-Penrose-Inverse of the empirical covariance 
matrix is 



rank(5j) 
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In particular, in the small sample case, the smallest p — n eigenvalues of the Moore-Penrose 
inverse are set to 0. This corresponds to cutting off directions with high frequency. While this 
introduces an additional bias, it tends to avoid the huge amount of variance that is due to 
the inversion of small but non-zero eigenvalues. In the transition case n/p = 1, all eigenvalues 
are 7^ (with some of them very small) and the MSE is most prominent in this situation. 
The striking peaking behavior for n = p is illustrated in e.g. Schafer and Strimmer (2005 1 . 



As a consequence, any statistical method that uses the pseudo-inverse of the covariance suffers 
from the peaking phenomenon. 

consequently, the peaking behavior also occurs in ordinary least squares regression, as it 
uses the pseudo-inverse, 



P. 



ols 



X T X) X T y. 



3 



~i 1 1 r~ 

50 100 150 200 

observations 



Figure 2: Peaking behavior of the ordinary least squares regression: £i-norm of the least 
squares estimate as a function of the number of observations. 

This is illustrated in Figure [2} On the training data of size n = 10, 20, . . . , 200, I compute 
the £i-norm of least squares estimate. The Figure displays the mean norm over all 10 runs. 
For n = p, the norm is particularly high. Note furthermore that except for n = p, the curve 
is rather smooth, and small changes in the number of observations only lead to small changes 
in the ^i-norm of the estimate. 

This observation is the key to understanding the peaking behavior of the Lasso. While 
for the estimation of the Lasso coefficients itself, the pseudo-inverse of the covariance matrix 
does not occur, it is used for model selection, via the regularization parameter s defined in 
Equation ([I]). I elaborate on this in the next section. 

4 Normalization of the Lasso Penalty 

Let me denote by n cv the number of observations in the k cross-validation splits, and by s n)CV 
the optimal parameter chosen via cross-validation. As n ~ n cv , one expects the MSE-optimal 
coefficients /3i asso n computed on a set of size n and the MSE-optimal coefficients /3i asso n 
based on a set of to be similar, i.e. 

n ~ n cv =^ HAasso.nlll ~ ll/^lasso,n c „ II 1 • 

Now, if n cv = p, then, in each of the k cross-validation splits, the number of observations 
equals the number of dimensions. As the least squares estimate is prone to the peaking 
behavior (recall Figure [2]), we observe 

ll3ols,nl|l <■ ll3ols,n c Jl ■ 

This implies that even though the ^i-norms of the regression coefficients /3 lasso are almost 
the same, their corresponding values of s differ drastically. To put it the other way around: 
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The optimal s found on the cross-validation splits (where n cv = p) is way too small, and it 
dramatically overestimates the amount of penalization. This explains the high test error in 
the case n cv=p that is indicated by the letter O in Figure [T] 

For n = p, the same argument applies. The optimal s cv on the cross-validation splits 
(where n cv < p) underestimates the amount of complexity in the n = p case, which leads to 
the peak indicated by the letter X in Figure [T] 

To illustrate that the peaking problem is indeed due to the parametrization ([I]) , I normalize 
the scaling parameter s in the following way. Let me denote by l\ \ Bcv the average over all 
k different £i-norms of the least squares estimates obtained on the k cross-validation splits. 
Furthermore, ^i )0 i s is the ^i-norm of the least squares estimates on the complete training data 
of size n. The normalized regularization parameter is 



-l,ols et) 
^l,ols 



(2) 



Note that the function lars returns the least squares solution, hence there are no additional 
computational costs. 

To illustrate the effectiveness of the normalization, I re-run the simulation experiments 
with cross-validation based on the normalized penalty paramet er This function - called 
mylars - is implemented in the R-package parcor version 0.1 (Kramer and Schafer 2009[ ). 
The results together with the results for the un-normalized parameter [T] are displayed in 
Figure |3j 
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Figure 3: Performance of the Lasso (black solid line) and the normalized Lasso (blue jagged 
line) as a function of the number of observations. Left: test error. Right: penalty term s 
(black solid line) and s (blue jagged line) as defined in Equation ([!]) and ^ respectively. 



5 Conclusion 



The peaking phenomenon is well-documented in the literature, and it effects every estimator 
that uses the pseudo-inverse of the sample covariance matrix. As I illustrate in this note, this 
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defect in the transition point n/p = 1 can also occur in more subtle ways. For the Lasso, the 
particular parameterization of the penalty term uses least-squares estimates, and it leads to 
difficulties in model selection. One can expect similar problems if one e.g. measures the fit of 
a model in terms of the total variance that it explains, and if the total variance is estimated 
using least squares. In this normalization as proposed above is advisable. 



Acknowledgements 

I observed the peaking phenomenon during the preparation of a paper with Juliane Schafer 



and Anne-Laure Boulesteix on regularized estimation of gaussian graphical models (Kramer 



et al. 2009|. Together with Lukas Meier, the three of us discussed the source of the peaking 



phenomenon in great detail. My colleagues Ryota Tomioka, Gilles Blanchard and Benjamin 
Blankertz provided additional material to the discussion and pointed to relevant literature. 



References 

Hastie, T. and Efron, B. (2007). lars: Least Angle Regression, Lasso and Forward Stagewise. 
R package version 0.9-7. 

Kramer, N. and Schafer, J. (2009). parcor: estimation of partial correlations based on regu- 
larized regression. R package version 0.1. 

Kramer, N., Schafer, J., and Boulesteix, A.-L. (2009). Regularized estimation of large-scale 
gene regulatory networks using graphical gaussian models, preprint. 

Opper, M. (2001). Learning to Generalize. Academic Press, pages 763-775. 

Raudys, S. and Duin, R. (1998). Expected classification error of the Fisher linear classifier 
with pseudo-inverse covariance matrix. Pattern Recognition Letters, 19(5-6):385-392. 

Schafer, J. and Strimmer, K. (2005). An empirical Bayes approach to inferring large-scale 
gene association networks. Bioinformatics, 2 1(6): 754-764. 

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal 
Statistical Society Series B, 58:267-288. 

Tresp, V. (2002). The Equivalence between Row and Column Linear Regression. Technical 
Report. 



6 



